Appendix D — Assignment D
Instructions
You may talk to a friend, discuss the questions and potential directions for solving them. However, you need to write your own solutions and code separately, and not as a group activity.
Write your code in the Code cells and your answer in the Markdown cells of the Jupyter notebook. Ensure that the solution is written neatly enough to understand and grade.
Use Quarto to print the .ipynb file as HTML. You will need to open the command prompt, navigate to the directory containing the file, and use the command:
quarto render filename.ipynb --to html. Submit the HTML file.The assignment is worth 100 points, and is due on Friday, 19th May 2023 at 11:59 pm.
Five points are properly formatting the assignment. The breakdown is as follows:
- Must be an HTML file rendered using Quarto (2 pts).
- There aren’t excessively long outputs of extraneous information (e.g. no printouts of entire data frames without good reason, there aren’t long printouts of which iteration a loop is on, there aren’t long sections of commented-out code, etc.) (1 pt)
- Final answers of each question are written in Markdown cells (1 pt).
- There is no piece of unnecessary / redundant code, and no unnecessary / redundant text (1 pt)
D.1 Conceptual
D.1.1 AdaBoost vs Random Forest
Among AdaBoost and Random Forest, which model is more sensitive to outliers in response and why? Consider both regression and classification.
(1 + 3 points)
D.1.2 Loss functions
Which loss functions should you use in boosting algorithms to reduce sensitivity to outliers in response, as compared to the squared error loss function, for regression problems. Name any 2 loss functions, and explain how they reduce the sensitivity towards outliers.
(2 + 2 points)
D.2 Regression Problem - Miami housing
D.2.1 Data preparation
Read the data miami-housing.csv. Check the description of the variables here. Split the data into 60% train and 40% test. Use random_state = 45. The response is SALE_PRC, and the rest of the columns are predictors, except PARCELNO. Print the shape of the predictors dataframe of the train data.
(2 points)
D.2.2 AdaBoost hyperparameter tuning
Develop and tune an AdaBoost model to predict SALE_PRC based on all the predictors. Compute the MAE on test data.
You must tune in the following manner:
Use
GridSearchCVto minimize the -fold mean absolute error (MAE).You are advised to do a coarse grid search first to get an idea of the domain space where the optimal hyperparameter values lie.
If you reach the goal with the coarse grid search, you can stop. Otherwise, you may follow it up with a finer grid search to get more precise optimal hyperparameter values.
You may decide yourself which hyperparameters you wish to tune. Tuning
max_depth,n_estimators, andlearning_rateshould suffice.
The test MAE must be less than $46,000. You must show the optimal values of the hyperparameters obtained, and the test MAE.
Note: Hyperparameter tuning must be done on train data. Test data is only to assess model performance. Test data must remain untouched until the model is finalized, and must only be used to compute the test MAE.
Hint: Below is one way to solve the problem. Note that there may be several completely different and better ways to solve the problem.
Consider tree depths of 3, 5, and 10, number of trees as 10, 50, 100, and 200, and learning rates as 0.0001, 0.001, 0.01, 0.1, and 1.0.
GridSearchCVtakes 2 minutes to execute on a 6-core laptop for these values.With the above search, you will probably fail to achieve the objective. However, when you visualize the 5-fold MAE with each of the hyperparameter values considered, you will realize that there is a particular hyperparameter for which you should consider higher / lower values. You will also realize that you need not consider some of the values of the remaining hyperparameters.
Do another 2-minute grid search based on what you realized in (2), and you should achieve the objective.
(10 points)
D.2.3 AdaBoost feature importance
Arrange and print the predictors in decreasing order of importance.
(2 points)
D.2.4 Huber loss
What is the advantage of the Huber loss function (page 349 of Elements of Statistical Learning) over the squared error and absolute error loss functions?
(4 points)
D.2.5 RandomizedSearchCV vs GridSearchCV
What’s the advantage of GridSearchCV over RandomizedSearchCV and vice-versa? When will GridSearchCV be preferred over RandomizedSearchCV and vice-versa?
(4 points)
D.2.6 Gradient boosting (Huber loss) hyperparameter tuning
Develop and tune a Gradient boosting model with Huber loss to predict SALE_PRC based on all the predictors. Compute the MAE on test data.
You must tune in the following manner:
Use may use
GridSearchCVorRandomizedSearchCVto minimize the -fold mean absolute error (MAE). You may choose any K.You may decide yourself which hyperparameters you wish to tune. Tuning
max_depth,n_estimators,learning_rate, andsubsampleshould suffice.
The MAE must be less than $43,000. You must show the optimal values of the hyperparameters obtained, and the test MAE.
Note: Hyperparameter tuning must be done on train data. Test data is only to assess model performance. Test data must remain untouched until the model is finalized, and must only be used to compute the test MAE.
Hint: Below is one way to solve the problem. Note that there may be several completely different and better ways to solve the problem.
Use 2-fold cross-validation to make the execution speed higher. Here, we are compromising - adding bias to the CV error to get a lesser execution time.
In gradient boosting, the suggested depth of trees is in [4, 8] (see page 363 in Elements of Statistical Learning). So, consider depths of 4, 6, and 8. Consider 3 values of number of trees in [100, 1000], 3 values of learning rates in [0.01, 0.5], and 3 subsample values in [0.5, 1]. It takes 8 minutes on a 6-core laptop to do an exhaustive search on these values.
With the above search, you will probably fail to achieve the objective. However, when you compare the optimal hyperparameter values obtained with the hyperparameter values considered, you will realize that there are some hyperparameters for which you should consider higher / lower values.
Do another 10-minute grid search based on what you realized in (3), and you should achieve the objective.
Further fine-tuning may reduce your MAE to up to $42,400. However, you can stop once it is below \$43,000 in (4).
(14 points)
D.2.7 Gradient boosting feature importance
Arrange and print the predictors in decreasing order of importance.
(2 points)
D.2.8 Bias-variance
For each of the following hyperparameters tuned in the previous question, explain how do they effect the bias / variance of a gradient boosting model, when their values are increased.
D.2.8.1 max_depth
D.2.8.2 n_estimators
D.2.8.3 learning_rate
D.2.8.4 subsample
(8 points)
D.2.9 XGBoost objective function
How is XGboost different from gradient boosting performed with the GradientBoostingRegressor() function in the previous question with regard to the optimization objective? How does it benefit prediction accuracy with XGBoost?
(4 points)
D.2.10 XGBoost hyperparameter tuning
Develop and tune an XGBoost model to predict SALE_PRC based on all the predictors. Compute the MAE on test data.
You must tune in the following manner:
Use may use
GridSearchCVorRandomizedSearchCVto minimize the -fold mean absolute error (MAE). You may choose any K.You may decide yourself which hyperparameters you wish to tune. Tuning
max_depth,n_estimators,learning_rate,reg_lambda,gammaandsubsampleshould suffice.
The test MAE must be less than $42,000. You must show the optimal values of the hyperparameters obtained, and the test MAE.
Note: Hyperparameter tuning must be done on train data. Test data is only to assess model performance. Test data must remain untouched until the model is finalized, and must only be used to compute the test MAE.
Hint: Below is one way to solve the problem. Note that there may be several completely different and better ways to solve the problem.
Inspired by the optimal hyperparameter values obtained in D.2.6, do a search with 2-fold cross validation. Even though the default loss function in XGBoost is
squarederror, hyperparameter values similar to the optimal hyperparameter values obtained in D.2.6 seem to work well. The regularization parametersgammaandreg_lambdahelp reduce MAE further.It took 10 minutes on a 6-core laptop to tune the model with (1), with the values of
gammaconsidered as 0 and 10, and values ofreg_lambdaconsidered as 0, 1, and 10.
(14 points)
D.2.11 XGBoost Feature importance
Arrange and print the predictors in decreasing order of importance.
(2 points)
D.3 Classification - Term deposit
The data for this question is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls, where bank clients were called to subscribe for a term deposit.
There is a train data - train.csv, which you will use to develop a model. There is a test data - test.csv, which you will use to test your model. Each dataset has the following attributes about the clients called in the marketing campaign:
age: Age of the clienteducation: Education level of the clientday: Day of the month the call is mademonth: Month of the cally: did the client subscribe to a term deposit?duration: Call duration, in seconds. This attribute highly affects the output target (e.g., ifduration=0 theny=‘no’). Yet, the duration is not known before a call is performed. Also, after the end of the callyis obviously known. Thus, this input should only be included for inference purposes and should be discarded if the intention is to have a realistic predictive model.
(Raw data source: Source. Do not use the raw data source for this assignment. It is just for reference.)
D.3.1 Data preparation
Convert all the categorical predictors in the data to dummy variables. Note that month and education are categorical variables.
(1 point)
D.3.2 Boosting
Develop and tune any boosting model to predict the probability of a client subscribing to a term deposit based on age, education, day and month. The model must have:
Minimum overall classification accuracy of 70% among the classification accuracies on train.csv, and test.csv.
Minimum recall of 65% among the recall on train.csv, and test.csv.
Print the accuracy and recall for both the datasets - train.csv, and test.csv.
Note that:
You cannot use
durationas a predictor. The predictor is not useful for prediction because its value is determined after the marketing call ends. However, after the call ends, we already know whether the client responded positively or negatively.You are free to choose any value of threshold probability for classifying observations. However, you must use the same threshold on both the datasets.
Use cross-validation on train data to optimize the model hyperparameters.
Using the optimal model hyperparameters obtained in (iii), develop the boosting model. Plot the cross-validated accuracy and recall against decision threshold probability. Tune the decision threshold probability based on the plot, or the data underlying the plot to achieve the required trade-off between recall and accuracy.
Evaluate the accuracy and recall of the developed model with the tuned decision threshold probability on both the datasets. Note that the test dataset must only be used to evaluate performance metrics, and not optimize any hyperparameters or decision threshold probability.
(20 points - 10 points for tuning the hyperparameters, 4 points for making the plot, 4 points for tuning the decision threshold probability based on the plot, and 2 points for printing the accuracy & recall on both the datasets)
It is up to you to pick the hyperparameters and their values in the grid.
Hint: Below is one way to solve the problem. Note that there may be several completely different and better ways to solve the problem.
XGBoost may help with tuning of n_estimators, max_depth, learning_rate, gamma, reg_lambda, and subsample. You may take the recommended value of scale_pos_weight. Use RandomizedSearchCV. Evaluation of 200 models with 5-fold cross validation, i.e., 1000 fits, takes 45 minutes on a 6-core laptop. You may try a 2-fold cross validation to reduce time.